26 research outputs found
Action Sets: Weakly Supervised Action Segmentation without Ordering Constraints
Action detection and temporal segmentation of actions in videos are topics of
increasing interest. While fully supervised systems have gained much attention
lately, full annotation of each action within the video is costly and
impractical for large amounts of video data. Thus, weakly supervised action
detection and temporal segmentation methods are of great importance. While most
works in this area assume an ordered sequence of occurring actions to be given,
our approach only uses a set of actions. Such action sets provide much less
supervision since neither action ordering nor the number of action occurrences
are known. In exchange, they can be easily obtained, for instance, from
meta-tags, while ordered sequences still require human annotation. We introduce
a system that automatically learns to temporally segment and label actions in a
video, where the only supervision that is used are action sets. An evaluation
on three datasets shows that our method still achieves good results although
the amount of supervision is significantly smaller than for other related
methods.Comment: CVPR 201
Weakly Supervised Action Learning with RNN based Fine-to-coarse Modeling
We present an approach for weakly supervised learning of human actions. Given
a set of videos and an ordered list of the occurring actions, the goal is to
infer start and end frames of the related action classes within the video and
to train the respective action classifiers without any need for hand labeled
frame boundaries. To address this task, we propose a combination of a
discriminative representation of subactions, modeled by a recurrent neural
network, and a coarse probabilistic model to allow for a temporal alignment and
inference over long sequences. While this system alone already generates good
results, we show that the performance can be further improved by approximating
the number of subactions to the characteristics of the different action
classes. To this end, we adapt the number of subaction classes by iterating
realignment and reestimation during training. The proposed system is evaluated
on two benchmark datasets, the Breakfast and the Hollywood extended dataset,
showing a competitive performance on various weak learning tasks such as
temporal action segmentation and action alignment
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval
Large-scale noisy web image-text datasets have been proven to be efficient
for learning robust vision-language models. However, when transferring them to
the task of video retrieval, models still need to be fine-tuned on hand-curated
paired text-video data to adapt to the diverse styles of video descriptions. To
address this problem without the need for hand-annotated pairs, we propose a
new setting, text-video retrieval with uncurated & unpaired data, that during
training utilizes only text queries together with uncurated web videos without
any paired text-video data. To this end, we propose an approach, In-Style, that
learns the style of the text queries and transfers it to uncurated web videos.
Moreover, to improve generalization, we show that one model can be trained with
multiple text styles. To this end, we introduce a multi-style contrastive
training procedure that improves the generalizability over several datasets
simultaneously. We evaluate our model on retrieval performance over multiple
datasets to demonstrate the advantages of our style transfer framework on the
new task of uncurated & unpaired text-video retrieval and improve
state-of-the-art performance on zero-shot text-video retrieval.Comment: Published at ICCV 2023, code: https://github.com/ninatu/in_styl
WEAR: A Multimodal Dataset for Wearable and Egocentric Video Activity Recognition
Though research has shown the complementarity of camera- and inertial-based
data, datasets which offer both modalities remain scarce. In this paper we
introduce WEAR, a multimodal benchmark dataset for both vision- and
wearable-based Human Activity Recognition (HAR). The dataset comprises data
from 18 participants performing a total of 18 different workout activities with
untrimmed inertial (acceleration) and camera (egocentric video) data recorded
at 10 different outside locations. WEAR features a diverse set of activities
which are low in inter-class similarity and, unlike previous egocentric
datasets, not defined by human-object-interactions nor originate from
inherently distinct activity categories. Provided benchmark results reveal that
single-modality architectures have different strengths and weaknesses in their
prediction performance. Further, in light of the recent success of
transformer-based video action detection models, we demonstrate their
versatility by applying them in a plain fashion using vision, inertial and
combined (vision + inertial) features as input. Results show that vision
transformers are not only able to produce competitive results using only
inertial data, but also can function as an architecture to fuse both modalities
by means of simple concatenation, with the multimodal approach being able to
produce the highest average mAP, precision and close-to-best F1-scores. Up
until now, vision-based transformers have neither been explored in inertial nor
in multimodal human activity recognition, making our approach the first to do
so. The dataset and code to reproduce experiments is publicly available via:
mariusbock.github.io/wearComment: 12 pages, 2 figures, 2 table